simulation <- list(
model = "calibrated_model",
replications = 2,
tbl_parameter = expand.grid(
network = c("neutral", "partisan", "antipartisan", "mixed")
),
master_seed = 42
)
results <- run_simulation(simulation)Last week, we focused on calibration, allowing us to fine-tune our model parameters to closely match empirical data. With a calibrated model in hand, we can now run experiments in silico-simulating experimental conditions in a virtual environment. This step enables us to explore key design questions, refine our experimental setup, and improve the planning of real-world studies.
Consider the question of how people’s opinions are influenced by others. More specifically, we are interested in how different network topologies affect polarization within a population.1 We assume that the influence of a neutral source differs from the influence of someone with shared partisanship, and also from someone with opposing partisanship. This sets the stage for a in silicio experiment. Figure 1 shows four different experimental treatments that participants might encounter when learning about the opinions of others:
Each treatment reflects a different network topology a certain agent (ego, the red circle) is a part of. The experiment should proceeds as follows: first, participants’ initial opinions are recorded individually. Then, they are informed about the opinions of their neighbors, after which their opinions are measured again. In the first (neutral) condition, ego learns only the opinions of four others, without knowing their partisanship. In the other three conditions, ego also learns their neighbors’ party preferences—either all similar (partisan), all opposing (antipartisan), or a balanced mix (mixed).
We assume that neutral influence (\(\mu\)) is weaker than influence from like-minded individuals (\(\mu_P\)), while influence from those with opposing partisanship (\(\mu_A\)) is weaker still—or even negative. In short, we expect \(\mu_A < \mu < \mu_P\). Drawing on empirical findings from previous studies, we use calibrated values for these parameters (calibrated_model) to run our experiment in silico:
To analyze the simulation output, we compute polarization within each simulated world. For this, we calculate the average absolute opinion difference between the two partisan groups. The use of absolute differences reflects our focus on the magnitude of disagreement, not which group supports which view.
# here the functions for polarization is inclueded
source("../data/abm-opinion3/functions_analysis.R")
df_polarization <- calc_polarization(results)
# show average differences between participants in the worlds
df_polarization |>
group_by(network) |>
summarize(
abs_diff_mean = mean(abs_diff_opinion_1)
)# A tibble: 4 × 2
network abs_diff_mean
<fct> <dbl>
1 neutral 0.0870
2 partisan 0.124
3 antipartisan 0.168
4 mixed 0.149
The output reveals that average polarization is lowest in the neutral condition and highest in the antipartisan condition. Just one round of interaction is enough to increase the polarization by over 8%-points with a change from neutral to antipartisan condition.
We can also assess the effect size of these differences (using the library effsize):
library(effsize)
d <- cohen.d(
formula = abs_diff_opinion_1 ~ network,
data = df_polarization |> filter(network %in% c("neutral", "antipartisan")) |>
droplevels(),
pooled = TRUE,
hedges.correction = FALSE
)
d
Cohen's d
d estimate: -0.8140591 (large)
95 percent confidence interval:
lower upper
-5.291375 3.663257
The result suggests a medium-to-large effect. But is that enough to justify a real-world experiment based on these findings? While the simulated effects are promising, we need to know whether such differences would be statistically significant with real experimental data. A common approach is to run the same statistical test we would run later in the real experiment to test whether polarization is different between treatments. Lets assume we are content with a simple t-test. Let us use this test to compare now the effect between the neutral and the antipartisan condition:
ttest <- df_polarization |>
filter(network %in% c("neutral", "antipartisan")) |>
t.test(abs_diff_opinion_1 ~ network, data = _)
ttest
Welch Two Sample t-test
data: abs_diff_opinion_1 by network
t = -0.81406, df = 1.4419, p-value = 0.5281
alternative hypothesis: true difference in means between group neutral and group antipartisan is not equal to 0
95 percent confidence interval:
-0.7159995 0.5536910
sample estimates:
mean in group neutral mean in group antipartisan
0.0869696 0.1681238
Yet, the difference between the two treatments is not statistically significant (\(p < 0.528\))—even though the effect size is considerable. Why? Because we only have two worlds per condition. Remember: polarization is a property of the entire simulated world, not of individuals. So even if each world contains many agents, our actual sample size for the statistical test is just two worlds per treatment.
Clearly, this sample size is far too small to detect substantial effects with confidence. Had we proceeded with a real-world experiment based on such limited data, we might have wasted significant resources. To prevent this, we turn to power analysis. Power analysis allows us to estimate how many simulated worlds (or real-world participants) are needed to detect effects of a given size with a high probability. This ensures that the experiments we design are not only theoretically sound but also practically effective.
At the same time, as you may already know from earlier statistics courses, increasing the sample size sufficiently will always lead to statistical significance, regardless of effect size. Therefore, we aim for a sample size that is just sufficient to detect the effect sizes we believe to be realistic—while remaining insensitive to much smaller, potentially spurious effects. In other words, power analysis helps us find the minimum number of worlds (or more generally sample size) required to robustly detect meaningful differences, without overcommitting resources or overfitting to noise.
Learning Goals
- Define and interpret statistical power.
- Identify the main components affecting experimental power.
- Calculate power using both analytic formulas and simulation methods.
- Conduct sensitivity analyses to assess how robust your power calculations are to assumptions.
Statistical Power
Statistical power (henceforth simply power) is the probability that an experiment correctly identifies a true effect—it’s our ability to detect a meaningful signal amidst the noise. In any experimental setting, the signal we seek is the actual impact of a treatment. This could be:
- Does educational intervention increase income levels?
- Do vaccination campaigns reduce disease incidence?
- Which network topologies affect partisan polarization the most?
The noise refers to the natural variability or randomness in data, caused by many unpredictable factors that are difficult—or even impossible—to measure or control for statistically. It’s typically quantified as the standard deviation of outcomes. Consider the following examples:
- If your outcome is the prevalence of a rare disease, daily fluctuations are usually minimal—thus, noise is low. Even a modest decrease (e.g., 1 percentage point) in disease rates could be detectable.
- Conversely, measuring income as an outcome is inherently noisier, as individual incomes can differ greatly. A modest income increase of 1% might easily go unnoticed due to substantial background noise.
Adequate power ensures that a true effect—if it exists—won’t be masked by randomness.
Power and Specificity
When planning an experiment, two core statistical concepts must be carefully balanced: power and specificity (see Table 1). Understanding how they interact is crucial for designing experiments that are both scientifically rigorous and resource-efficient.
Specificity refers to the probability of correctly retaining the null hypothesis when it is actually true. In other words, it is the ability of a test to avoid false positives. The significance level \(\alpha\) defines how willing we are to risk a Type I error—rejecting the null hypothesis \(H_0\) when it is actually true. This is often set at 0.05, meaning we accept a 5% chance of a false positive. Accordingly, the specificity of such a test is \(1 - \alpha = 0.95\).
On the other hand, statistical power—denoted as \(1 - \beta\)—is the probability of correctly detecting a true effect. It represents the ability to avoid a Type II error (failing to reject \(H_0\) when it is false). The value \(\beta\) quantifies the probability of making such an error. Power is conventionally set at 80%, meaning we aim for \(1 - \beta = 0.8\), or in other words, we want to correctly detect true effects in 4 out of 5 experiments.
Balancing power and specificity is essential: increasing one often comes at the cost of the other. Power analysis helps us navigate this trade-off, ensuring that we choose a sample size large enough to detect meaningful effects, without inflating the risk of false positives.
| \(H_0\) is true | \(H_1\) is true | ||
|---|---|---|---|
| Test decision… | … in favor of \(H_0\) | ✅ Correct decision (Specificity) Probability: 1 − α |
❌ Type II error Probability: β |
| … in favor of \(H_1\) | ❌ Type I error Probability: α |
✅ Correct decision (Power) Probability: 1 − β |
Four Core Elements of Statistical Power
Four primary factors shape an experiment’s power:
The test itself and the statistical significance criterion The type of statistical test used—and the threshold set for statistical significance—directly influences power. Some tests are inherently more sensitive than others, depending on the data and research design. In addition, the choice of \(\alpha\) (commonly set at 0.05) defines the strictness of evidence required to reject the null hypothesis. A lower \(\alpha\) (e.g., 0.01) reduces the chance of false positives but also makes it harder to detect true effects, thereby reducing power. Conversely, increasing \(\alpha\) boosts power but raises the risk of false positives.
Strength of Treatment
Stronger treatments are easier to detect and therefore increase statistical power. For example, giving each participant a large sum of money is more likely to produce a measurable effect than giving only a small sum. However, in practice, researchers often face constraints—whether due to limited resources or limited control over the intervention—especially when evaluating existing programs or policies.Background Noise (Variability)
The more variability there is in outcomes, the harder it is to detect a true effect. Selecting outcome measures with lower variability can help improve power. In many cases, however, variability is inherent and difficult to reduce—especially when measuring complex behaviors like income or attitudes. This is precisely one of the reasons researchers turn to lab experiments: to create controlled environments that minimize noise. By standardizing conditions across participants, lab settings help keep variability low, making it easier to isolate and detect the effect of interest.Experimental Design
This is (often) the most direct lever researchers have to influence power. Power analysis often focuses on sample size—larger samples reduce uncertainty and increase power. But good design goes beyond just size. Consider:- How units are randomized (individually or in clusters),
- Whether key covariates are measured and controlled for,
- How many treatment groups are included and how they’re structured.
Time To Play (and Learn!)
In the interactive graph below, you can explore how the key parameters of hypothesis testing—significance criterion \(\alpha\), difference in mean \(\mu_1\), and standard deviation \(\sigma\)—interact to shape the probabilities of different outcomes.2 By adjusting the sliders, observe how the rates of Type I errors (false positives) and Type II errors (false negatives) change.
- The blue shaded areas represent Type I errors: regions where we incorrectly reject the null hypothesis (\(H_0\) is true, but we reject it).
- The red shaded area represents a Type II error: failing to reject the null hypothesis when the alternative is actually true.
- The remaining areas under the curves represent correct decisions: either correctly retaining the null or correctly detecting a true effect.
In an ideal scenario, we aim to minimize both the blue and red areas—reducing the likelihood of false discoveries and missed effects. Try adjusting the parameters to see how you can shrink both error regions simultaneously!
Significance criterion \(\alpha\):
Difference in mean \(\mu_1\):
Standard Deviation \(\sigma\):
As you will notice, all three parameters—significance criterion \(\alpha\), difference in mean \(\mu_1\), and standard deviation \(\sigma\)—impact the probability of a Type II error (\(\beta\)). However, they are not equally advisable to manipulate from a scientific perspective.
While increasing the significance criterion \(\alpha\) can improve power by reducing \(\beta\), doing so comes at the cost of raising the false positive rate. This compromises the integrity of your results and increases the risk of contributing to irreproducible findings. Good scientific practice discourages simply relaxing \(\alpha\) to gain statistical significance.
Instead, maintaining a conventional threshold (e.g., \(\alpha = 0.05\)) ensures that evidence required to reject the null hypothesis remains stringent. This protects the credibility of statistical inference and helps prevent the overstatement of weak or spurious effects—one of the central concerns in the current replication crisis across many disciplines.
Therefore, we should seek to influence \(\beta\) by adjusting parameters that improve our experimental design without compromising statistical integrity—namely, increasing effect size (e.g., through better treatment design), reducing variability (e.g., by improving measurement precision), or increasing sample size (e.g., adding more participants or simulation runs).
Conducting a Power Analysis
In the previous section, we explored several key design choices—such as sample size, randomization strategy, and control of covariates—that can greatly influence an experiment’s statistical power. To make informed design decisions, however, we must first understand how to quantify power. This is where power analysis comes in.
There are two primary approaches to conducting a power analysis:
- Analytical methods, which use mathematical formulas derived from statistical theory to estimate power.
- Simulation-based methods, which involve generating synthetic datasets under specified assumptions and repeatedly testing them to observe how often a statistically significant result occurs.
Each approach has its advantages. Analytical formulas are fast, interpretable, and useful for standard designs—but they depend on idealized assumptions (e.g., normality, equal variances). Simulation, by contrast, is computationally more intensive but highly flexible, making it better suited for complex or customized scenarios.
In the sections that follow, we will explore both methods in detail and discuss how to apply them to real experimental planning.
The Analytical Way
There are many formulas for calculating statistical power, each tailored to a specific type of test (e.g., t-tests, ANOVA, regression). To keep things simple, we’ll focus on the classical two-sample, two-sided t-test, which compares the means of two independent groups.
For this case, the power (\(1 - \beta\)) can be approximated using the following formula:
\[ 1-\beta \approx 1 - \Phi\left( \Phi^{-1} \left(1-\frac{\alpha}{2}\right) - \frac{|\mu_t-\mu_c|\sqrt{N}}{2\sigma} \right) \]
Here’s what these symbols mean:
- \(\beta\): Probability of making a Type II error.
- \(\Phi\): Cumulative distribution function (CDF) for the normal distribution.
- \(\Phi^{-1}\): The inverse normal distribution function.
- \(\mu_t\): Mean outcome of the treatment group.
- \(\mu_c\): Mean outcome of the control group.
- \(\sigma\): Standard deviation (noise) of outcomes, assumed equal across groups.
- \(\alpha\): Significance level (commonly set at 0.05).
- \(N\): Total sample size.
Note that this approximation becomes more accurate as \(N\) increases, because under the alternative hypothesis (\(H_1\)), the Student’s t-distribution converges to the standard normal distribution.
Let’s apply this formula to estimate the power of our experiment with \(N = 4\) (2 simulated worlds per condition):
power_calculator <- function(mu_t, mu_c, sigma, alpha=0.05, N, two.sided = TRUE){
if (!two.sided) alpha <- 2 * alpha
inside <- qnorm(1 - alpha/2) - abs(mu_t - mu_c) * sqrt(N) / (2 * sigma)
1 - pnorm(inside)
}
# calculate the sd's for our two experimental conditions
sigma <- df_polarization |>
filter(network %in% c("neutral", "antipartisan")) |>
pull(abs_diff_opinion_1) |>
sd()
power_calculator(mu_t=0.1681238, mu_c= 0.0869696, sigma = sigma, alpha=0.05, N=4) [1] 0.1365657
We see that with \(N = 4\), we can only expect about 13.7% power—meaning that in roughly 86% of similar experiments, we would fail to detect a real effect, despite a substantial difference between groups.
How many observations do we need? For a quick approximation, we can use Lehr’s rule of thumb (1992), which estimates sample size for an 80% powered, two-sided t-test at \(\alpha = 0.05\) as:
\[ N \approx 16 \cdot \frac{\sigma^2}{(\mu_c-\mu_t)^2} \]
Let’s apply this rule in R:
approx_N <- function(
mu_t,
mu_c,
sigma
){
16 * (sigma^2 / (mu_t - mu_c)^2)
}
approx_N(
mu_t = 0.1681238,
mu_c = 0.0869696,
sigma = sigma
)[1] 21.4293
This yields an estimated \(N \geq 22\) to achieve 80% power.
While this is helpful for a rough estimate, it’s not advisable to rely solely on such approximations. To improve accuracy, we can use the WebPower package, which offers more precise calculations and supports a range of designs, including unequal group sizes, paired samples, and more.
Let’s use WebPower to plot how power increases with \(N\), using the previously calculated effect size as an anchor.
#install.packages("WebPower")
library(WebPower)
# we calculated the d earlyer with cohens.d
res <- wp.t(n1=seq(20, 30, by=1), d=d$estimate, type="two.sample", alternative="two.sided")
plot(res)
abline(h = 0.8, lty = 2, col = "red")We observe that while Lehr’s rule suggests \(N = 22\), a more precise estimate shows we would actually need at least \(N = 25\) to achieve 80% power in our experiment.
Trusting Your Power Analysis
Power analyses inherently rely on uncertain assumptions. Critics might point out this circular reasoning: you use guesses about effect size and variability to determine how likely you are to detect effects.
However, power analysis is a powerful tool precisely because it reveals sensitivity to these assumptions. You can easily explore how your conclusions change with variations in assumed effect size, sample size, and noise.
For instance, vary the sample size (\(N\)) to observe how power changes:
- Larger samples increase power.
- Smaller effect sizes or higher variability reduce power.
Conducting sensitivity analyses this way helps you make informed, transparent decisions about experimental design.
It’s also worth noting that our current estimates rely on just a few simulations (e.g., \(N = 4\) total worlds for comparing neutral vs. antipartisan conditions). This low number limits the reliability of our estimates. To obtain more trustworthy results, we should next turn to simulation-based power estimation—directly measuring how often our experiment would succeed, given our current assumptions.
Simulation: A Practical Alternative to Formula-Based Power Analysis
Traditional formulas offer quick insights, but modern computational methods—such as simulations—can provide richer, more flexible analysis.
Simulation involves:
- Repeatedly generating hypothetical datasets based on your assumptions.
- Conducting statistical tests on these simulated datasets.
- Calculating the proportion of simulated experiments detecting significant effects.
This approach makes fewer assumptions and lets you explore complex scenarios realistically, especially useful for advanced or novel experimental designs.
Simulations thus enhance your understanding of an experiment’s robustness and strengthen the validity of your experimental findings.
Implementation in R
Under normal circumstances—when we’re interested in individual-level outcomes—our run_simulation() function would suffice for estimating the required \(N\). We could vary \(N\) directly in tbl_parameter and track when the treatment becomes significant in more than 80% of the replications.
However, in our case, we’re interested in a property of the world itself (i.e., polarization), not in individual behavior. In this context, the replications used in run_simulation() actually correspond to the number of independent worlds (or experimental sessions) we simulate per treatment.
A clean and robust solution would be to write a run_meta_simulation() function that wraps around run_simulation() and handles repeated simulation for each sample size, collecting p-values across runs. However, such an implementation is complex. Instead, we’ll take a shortcut and slightly modify how we use our existing run_simulation():
replications = 100
worlds_per_treatment_max = 40
simulation <- list(
model = "calibrated_model",
replications = worlds_per_treatment_max * replications,
tbl_parameter = expand.grid(
network = c("neutral", "antipartisan")
),
master_seed = 42
)Here we define 100 replications and 40 as the maximum number of worlds per treatment we want to simulate.
Now we run this simulation:
meta_results <- run_simulation(simulation)The idea is as follows: replication 1 to 100 corresponds to the first world, 101 to 200 to the second world, and so on. If we want to compare, for example, 10 neutral worlds with 5 antipartisan worlds, we can take replications 1, 101, 201, …, 901 and compare those to the corresponding replications for the other condition.
To automate this, we write a function that calculates power by looping over replications, collecting p-values, and computing the share of tests that fall below our \(\alpha\) threshold:
get_power_from_simulation <- function(
df,
n_1,
n_2,
by_var, # unquoted column name, e.g. network
c_1, # value of by_var for group 1, e.g. "neutral"
c_2, # value of by_var for group 2, e.g. "antipartisan"
replications,
alpha = 0.05
){
by_var <- enquo(by_var)
# Pre‐filter the two condition sets once:
df1 <- df |> filter((!!by_var) == c_1)
df2 <- df |> filter((!!by_var) == c_2)
stopifnot(nrow(df1) >= n_1 * replications,
nrow(df2) >= n_2 * replications)
rejections <- vapply(
seq_len(replications),
FUN.VALUE = logical(1),
function(i) {
# pick out the i-th replication of worlds 1..n_1 in cond1
vals1 <- df1 %>%
slice( seq(from = i, by = replications, length.out = n_1) ) %>%
pull(abs_diff_opinion_1)
# similarly for cond2
vals2 <- df2 %>%
slice( seq(from = i, by = replications, length.out = n_2) ) %>%
pull(abs_diff_opinion_1)
# run the defined test
t.test(vals1, vals2, alternative = "two.sided")$p.value < alpha
}
)
# proportion of rejections = estimated power
mean(rejections)
}We can now apply this function to estimate power for 10 neutral and 5 antipartisan worlds:
df_polarization <- calc_polarization(meta_results)
get_power_from_simulation(
df = df_polarization,
n_1 = 10,
n_2 = 5,
by_var = network,
c_1 = "neutral",
c_2 = "antipartisan",
replications = 100
)[1] 0.08
Not surprisingly, power is still low—around 8%. But now, let’s systematically explore how power increases as we raise the number of worlds per treatment:
sample_sizes <- seq(10, 40, by = 2)
df <- tibble(sample_size = sample_sizes) |>
mutate(
power = map_dbl(sample_size, ~
get_power_from_simulation(
df = df_polarization,
n_1 = .x,
n_2 = .x,
by_var = network,
c_1 = "neutral",
c_2 = "antipartisan",
replications = 100
)
)
)
library(plotly)
plot_ly(df,
x = ~sample_size,
y = ~power,
type = 'scatter',
mode = 'lines+markers',
hoverinfo = 'text',
text = ~paste0('N = ', sample_size, '<br>Power = ', round(power, 2))
) %>%
add_lines(x = ~sample_size,
y = rep(0.8, nrow(df)),
line = list(dash = 'dash', color = 'red'),
name = '80% Power') %>%
layout(
title = 'Power vs. Sample Size',
xaxis = list(title = 'Sample Size (N)'),
yaxis = list(title = 'Power'),
showlegend = TRUE
)We observe that even with a symmetric sample size of 18 worlds per condition, we achieve over 80% power. This is lower than our earlier approximation using analytical methods—likely because our original estimate was based on only 4 observations to calculate \(\sigma\), introducing variability.
We can inspect the effect size again for the the complete run (40 groups per condition, 100 replications):
cohen.d(
formula = abs_diff_opinion_1 ~ network,
data = df_polarization |> filter(network %in% c("neutral", "antipartisan")) |>
droplevels(),
pooled = TRUE,
hedges.correction = FALSE
)
Cohen's d
d estimate: -0.7979546 (medium)
95 percent confidence interval:
lower upper
-0.8434983 -0.7524109
We see that this is not far off what we initially estimated.
This simulation-based approach gives us a robust, assumption-light estimate of the power under realistic experimental conditions. It’s particularly valuable when classical formulas are too simplistic for the design at hand.
More Complex Stuff
So far, we have considered our experiment as having the following structure:
\[ R \longrightarrow T_0 \longrightarrow Y_0 \\ R \longrightarrow T_1 \longrightarrow Y_1 \] In this setup, participants are randomized (\(R\)) into either a control group (\(T_0\), neutral) or a treatment group (\(T_1\), antipartisan), and we then measure the outcome—average partisan difference in opinions—after the treatment.
However, our actual experimental design contains more information:
\[ R \longrightarrow Y_0^{t_0} \longrightarrow T_0 \longrightarrow Y_0^{t_1} \\ R \longrightarrow Y_1^{t_0} \longrightarrow T_1 \longrightarrow Y_1^{t_1} \] That is, we also collect an initial measurement of opinions before treatment assignment, allowing us to observe polarization at baseline (\(Y^{t_0}\)), in addition to the post-treatment outcome (\(Y^{t_1}\)).
This pre-treatment measurement enables a more refined analysis. Instead of relying solely on a post-treatment comparison using a simple \(t\)-test, we can take advantage of the pre-treatment data and employ a regression framework that controls for baseline values. This not only increases statistical power, but also helps account for pre-existing variation in the outcome.
df_polarization <- calc_polarization(meta_results) %>%
bind_cols(
calc_polarization(meta_results, table = "init") %>%
rename(abs_diff_opinion_1_init = abs_diff_opinion_1) %>%
select(abs_diff_opinion_1_init)
)🧩 Conclusion
Running a rigior simulation with power analysis not only helps your design, it also supports transparent preregistration, strengthens scientific rigor, and enhances the credibility of grant applications—after all, well-visualized simulations make a compelling case.
References
Footnotes
Note that polarization, unlike an opinion, is not an individual property. Instead, it reflects a collective feature of the system. In agent-based modeling terms, polarization is a property of an entire world, not of any one agent.↩︎
In this interactive graph, \(\sigma\) does not only represent standard deviation in the classical sense, but can also serve as a proxy for sample size. Increasing the sample size reduces uncertainty around the estimated mean, effectively shrinking the observed standard deviation due to the law of large numbers. In that sense, decreasing \(\sigma\) here mimics the effect of increasing the number of observations.↩︎